Nobel_Prize_Analysis¶

Introduction¶

On November 27, 1895, Alfred Nobel signed his last will in Paris. When it was opened after his death, the will caused a lot of controversy, as Nobel had left much of his wealth for the establishment of a prize.

Alfred Nobel dictates that his entire remaining estate should be used to endow “prizes to those who, during the preceding year, have conferred the greatest benefit to humankind”.

Every year the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace.

Let's see what patterns we can find in the data of the past Nobel laureates. What can we learn about the Nobel prize and our world more generally?

In [1]:
# pip install --upgrade plotly
In [2]:
# %pip install --upgrade plotly

Import Statements¶

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

Notebook Presentation¶

In [4]:
pd.options.display.float_format = '{:,.2f}'.format

Read the Data¶

In [5]:
df_data = pd.read_csv('nobel_prize_data.csv')

Caveats: The exact birth dates for Michael Houghton, Venkatraman Ramakrishnan, and Nadia Murad are unknown. I've substituted them with mid-year estimate of July 2nd.

Data Exploration & Cleaning¶

Preliminary data exploration.

  • What is the shape of df_data? How many rows and columns?
  • What are the column names?
  • In which year was the Nobel prize first awarded?
  • Which year is the latest year included in the dataset?
In [6]:
df_data.head(5)
Out[6]:
year category prize motivation prize_share laureate_type full_name birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO
0 1901 Chemistry The Nobel Prize in Chemistry 1901 "in recognition of the extraordinary services ... 1/1 Individual Jacobus Henricus van 't Hoff 1852-08-30 Rotterdam Netherlands Netherlands Male Berlin University Berlin Germany NLD
1 1901 Literature The Nobel Prize in Literature 1901 "in special recognition of his poetic composit... 1/1 Individual Sully Prudhomme 1839-03-16 Paris France France Male NaN NaN NaN FRA
2 1901 Medicine The Nobel Prize in Physiology or Medicine 1901 "for his work on serum therapy, especially its... 1/1 Individual Emil Adolf von Behring 1854-03-15 Hansdorf (Lawice) Prussia (Poland) Poland Male Marburg University Marburg Germany POL
3 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 Individual Frédéric Passy 1822-05-20 Paris France France Male NaN NaN NaN FRA
4 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 Individual Jean Henry Dunant 1828-05-08 Geneva Switzerland Switzerland Male NaN NaN NaN CHE
In [7]:
df_data.shape
Out[7]:
(962, 16)
In [8]:
df_data.columns
Out[8]:
Index(['year', 'category', 'prize', 'motivation', 'prize_share',
       'laureate_type', 'full_name', 'birth_date', 'birth_city',
       'birth_country', 'birth_country_current', 'sex', 'organization_name',
       'organization_city', 'organization_country', 'ISO'],
      dtype='object')
In [9]:
min(df_data.year)
Out[9]:
1901
In [10]:
max(df_data.year)
Out[10]:
2020

Check for Duplicates¶

  • Are there any duplicate values in the dataset?
  • Are there NaN values in the dataset?
  • Which columns tend to have NaN values?
  • How many NaN values are there per column?
  • Why do these columns have NaN values?
In [11]:
df_data.duplicated().sum()
Out[11]:
0

Check for NaN Values¶

In [12]:
print(f' is the data has nan values: {df_data.isna().sum().any()}, and thier count is {df_data.isna().sum().sum()}')
 is the data has nan values: True, and thier count is 1023
In [13]:
df_data.isna().sum()
Out[13]:
year                       0
category                   0
prize                      0
motivation                88
prize_share                0
laureate_type              0
full_name                  0
birth_date                28
birth_city                31
birth_country             28
birth_country_current     28
sex                       28
organization_name        255
organization_city        255
organization_country     254
ISO                       28
dtype: int64

Type Conversions¶

  • Converting the birth_date column to Pandas Datetime objects
  • Adding a Column called share_pct which has the laureates' share as a percentage in the form of a floating-point number.

Convert Year and Birth Date to Datetime¶

In [14]:
df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   year                   962 non-null    int64 
 1   category               962 non-null    object
 2   prize                  962 non-null    object
 3   motivation             874 non-null    object
 4   prize_share            962 non-null    object
 5   laureate_type          962 non-null    object
 6   full_name              962 non-null    object
 7   birth_date             934 non-null    object
 8   birth_city             931 non-null    object
 9   birth_country          934 non-null    object
 10  birth_country_current  934 non-null    object
 11  sex                    934 non-null    object
 12  organization_name      707 non-null    object
 13  organization_city      707 non-null    object
 14  organization_country   708 non-null    object
 15  ISO                    934 non-null    object
dtypes: int64(1), object(15)
memory usage: 120.4+ KB
In [15]:
df_data.birth_date = pd.to_datetime(df_data.birth_date)
In [16]:
df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   year                   962 non-null    int64         
 1   category               962 non-null    object        
 2   prize                  962 non-null    object        
 3   motivation             874 non-null    object        
 4   prize_share            962 non-null    object        
 5   laureate_type          962 non-null    object        
 6   full_name              962 non-null    object        
 7   birth_date             934 non-null    datetime64[ns]
 8   birth_city             931 non-null    object        
 9   birth_country          934 non-null    object        
 10  birth_country_current  934 non-null    object        
 11  sex                    934 non-null    object        
 12  organization_name      707 non-null    object        
 13  organization_city      707 non-null    object        
 14  organization_country   708 non-null    object        
 15  ISO                    934 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(14)
memory usage: 120.4+ KB

Filtering on the NaN values

In [17]:
col_subset = ['year','category', 'laureate_type',
              'birth_date','full_name', 'organization_name']
df_data.loc[df_data.birth_date.isna()][col_subset]
Out[17]:
year category laureate_type birth_date full_name organization_name
24 1904 Peace Organization NaT Institut de droit international (Institute of ... NaN
60 1910 Peace Organization NaT Bureau international permanent de la Paix (Per... NaN
89 1917 Peace Organization NaT Comité international de la Croix Rouge (Intern... NaN
200 1938 Peace Organization NaT Office international Nansen pour les Réfugiés ... NaN
215 1944 Peace Organization NaT Comité international de la Croix Rouge (Intern... NaN
237 1947 Peace Organization NaT American Friends Service Committee (The Quakers) NaN
238 1947 Peace Organization NaT Friends Service Council (The Quakers) NaN
283 1954 Peace Organization NaT Office of the United Nations High Commissioner... NaN
348 1963 Peace Organization NaT Comité international de la Croix Rouge (Intern... NaN
349 1963 Peace Organization NaT Ligue des Sociétés de la Croix-Rouge (League o... NaN
366 1965 Peace Organization NaT United Nations Children's Fund (UNICEF) NaN
399 1969 Peace Organization NaT International Labour Organization (I.L.O.) NaN
479 1977 Peace Organization NaT Amnesty International NaN
523 1981 Peace Organization NaT Office of the United Nations High Commissioner... NaN
558 1985 Peace Organization NaT International Physicians for the Prevention of... NaN
588 1988 Peace Organization NaT United Nations Peacekeeping Forces NaN
659 1995 Peace Organization NaT Pugwash Conferences on Science and World Affairs NaN
682 1997 Peace Organization NaT International Campaign to Ban Landmines (ICBL) NaN
703 1999 Peace Organization NaT Médecins Sans Frontières NaN
730 2001 Peace Organization NaT United Nations (U.N.) NaN
778 2005 Peace Organization NaT International Atomic Energy Agency (IAEA) NaN
788 2006 Peace Organization NaT Grameen Bank NaN
801 2007 Peace Organization NaT Intergovernmental Panel on Climate Change (IPCC) NaN
860 2012 Peace Organization NaT European Union (EU) NaN
873 2013 Peace Organization NaT Organisation for the Prohibition of Chemical W... NaN
897 2015 Peace Organization NaT National Dialogue Quartet NaN
919 2017 Peace Organization NaT International Campaign to Abolish Nuclear Weap... NaN
958 2020 Peace Organization NaT World Food Programme (WFP) NaN

rows where the organization_name column has no value

In [18]:
col_subset = ['year','category', 'laureate_type','full_name', 'organization_name']
df_data.loc[df_data.organization_name.isna()][col_subset]
Out[18]:
year category laureate_type full_name organization_name
1 1901 Literature Individual Sully Prudhomme NaN
3 1901 Peace Individual Frédéric Passy NaN
4 1901 Peace Individual Jean Henry Dunant NaN
7 1902 Literature Individual Christian Matthias Theodor Mommsen NaN
9 1902 Peace Individual Charles Albert Gobat NaN
... ... ... ... ... ...
932 2018 Peace Individual Nadia Murad NaN
942 2019 Literature Individual Peter Handke NaN
946 2019 Peace Individual Abiy Ahmed Ali NaN
954 2020 Literature Individual Louise Glück NaN
958 2020 Peace Organization World Food Programme (WFP) NaN

255 rows × 5 columns

Adding a Column with the Prize Share as a Percentage¶

In [19]:
seperated_values = df_data.prize_share.str.split('/',expand=True)
numerator = pd.to_numeric(seperated_values[0])
denomerator  =pd.to_numeric(seperated_values[1])
df_data['share_pct'] = numerator / denomerator
In [20]:
df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   year                   962 non-null    int64         
 1   category               962 non-null    object        
 2   prize                  962 non-null    object        
 3   motivation             874 non-null    object        
 4   prize_share            962 non-null    object        
 5   laureate_type          962 non-null    object        
 6   full_name              962 non-null    object        
 7   birth_date             934 non-null    datetime64[ns]
 8   birth_city             931 non-null    object        
 9   birth_country          934 non-null    object        
 10  birth_country_current  934 non-null    object        
 11  sex                    934 non-null    object        
 12  organization_name      707 non-null    object        
 13  organization_city      707 non-null    object        
 14  organization_country   708 non-null    object        
 15  ISO                    934 non-null    object        
 16  share_pct              962 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(14)
memory usage: 127.9+ KB

Plotly Donut Chart: Percentage of Male vs. Female Laureates¶

Creating a donut chart using plotly which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?

In [21]:
biology = df_data.sex.value_counts()
fig = px.pie(labels=biology.index, 
             values=biology.values,
             title="Percentage of Male vs. Female Winners",
             names=biology.index,
             hole=0.6,)
 
fig.update_traces(textposition='inside', textfont_size=15, textinfo='percent')
 
fig.show()
C:\Users\mohamed.beshier\Anaconda3\lib\site-packages\plotly\express\_core.py:138: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  return args["labels"][column]

Who were the first 3 Women to Win the Nobel Prize?¶

  • What are the names of the first 3 female Nobel laureates?
  • What did the win the prize for?
  • What do you see in their birth_country? Were they part of an organisation?
In [22]:
# df_data[df_data.sex == 'Female'].sort_values('year', ascending=True)[:3]
df_data.sort_values(by='year',ascending=True).query('sex=="Female"').head(3)[['full_name','prize','year','birth_country']]
Out[22]:
full_name prize year birth_country
18 Marie Curie, née Sklodowska The Nobel Prize in Physics 1903 1903 Russian Empire (Poland)
29 Baroness Bertha Sophie Felicita von Suttner, n... The Nobel Peace Prize 1905 1905 Austrian Empire (Czech Republic)
51 Selma Ottilia Lovisa Lagerlöf The Nobel Prize in Literature 1909 1909 Sweden

Find the Repeat Winners¶

Did some people get a Nobel Prize more than once? If so, who were they?

In [23]:
is_winner = df_data.duplicated(subset = ['full_name'] , keep=False)
multiple_winners = df_data[is_winner]
print(f'there are {multiple_winners.full_name.nunique()} winners who where awarded more htran one time ') 
there are 6 winners who where awarded more htran one time 
In [24]:
col_subset = ['year', 'category', 'laureate_type', 'full_name']
multiple_winners[col_subset]
Out[24]:
year category laureate_type full_name
18 1903 Physics Individual Marie Curie, née Sklodowska
62 1911 Chemistry Individual Marie Curie, née Sklodowska
89 1917 Peace Organization Comité international de la Croix Rouge (Intern...
215 1944 Peace Organization Comité international de la Croix Rouge (Intern...
278 1954 Chemistry Individual Linus Carl Pauling
283 1954 Peace Organization Office of the United Nations High Commissioner...
297 1956 Physics Individual John Bardeen
306 1958 Chemistry Individual Frederick Sanger
340 1962 Peace Individual Linus Carl Pauling
348 1963 Peace Organization Comité international de la Croix Rouge (Intern...
424 1972 Physics Individual John Bardeen
505 1980 Chemistry Individual Frederick Sanger
523 1981 Peace Organization Office of the United Nations High Commissioner...

Number of Prizes per Category¶

  • In how many categories are prizes awarded?
  • Which category has the most number of prizes awarded?
  • Which category has the fewest number of prizes awarded?
In [25]:
df_data.category.nunique()
Out[25]:
6
In [26]:
categoryData = df_data.category.value_counts()
fig = px.bar(df_data,y = categoryData.values ,
             x = categoryData.index,
            color = categoryData.values,
            color_continuous_scale='Aggrnyl',
             title = 'Number of Prizes per Category'
            )
fig.update_layout(xaxis_title = 'Nobel Prize Ctegory',
       yaxis_title = 'bumver of prizes',
       coloraxis_showscale = False )
fig.show()
  • When was the first prize in the field of Economics awarded?
  • Who did the prize go to?
In [27]:
df_data[df_data.category=='Economics'].sort_values(by='year' , ascending=True)[:3]
Out[27]:
year category prize motivation prize_share laureate_type full_name birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO share_pct
393 1969 Economics The Sveriges Riksbank Prize in Economic Scienc... "for having developed and applied dynamic mode... 1/2 Individual Jan Tinbergen 1903-04-12 the Hague Netherlands Netherlands Male The Netherlands School of Economics Rotterdam Netherlands NLD 0.50
394 1969 Economics The Sveriges Riksbank Prize in Economic Scienc... "for having developed and applied dynamic mode... 1/2 Individual Ragnar Frisch 1895-03-03 Oslo Norway Norway Male University of Oslo Oslo Norway NOR 0.50
402 1970 Economics The Sveriges Riksbank Prize in Economic Scienc... "for the scientific work through which he has ... 1/1 Individual Paul A. Samuelson 1915-05-15 Gary, IN United States of America United States of America Male Massachusetts Institute of Technology (MIT) Cambridge, MA United States of America USA 1.00

Male and Female Winners by Category¶

Creating a plotly bar chart that shows the split between men and women by category.

  • Hover over the bar chart. How many prizes went to women in Literature compared to Physics?
In [28]:
cat_men_women = df_data.groupby(['category','sex'] ,
                                as_index=False).agg({'prize':pd.Series.count})
cat_men_women.sort_values('prize',ascending =False , inplace = True)
In [29]:
cat_men_women
Out[29]:
category sex prize
11 Physics Male 212
7 Medicine Male 210
1 Chemistry Male 179
5 Literature Male 101
9 Peace Male 90
3 Economics Male 84
8 Peace Female 17
4 Literature Female 16
6 Medicine Female 12
0 Chemistry Female 7
10 Physics Female 4
2 Economics Female 2
In [30]:
fig = px.bar(cat_men_women,y = 'prize' ,
            x = 'category',
            color ='sex',
            color_continuous_scale='Aggrnyl',
            title='Number of Prizes Awarded per Category split by Men and Women')
fig.update_layout(xaxis_title = 'Nobel Prize Ctegory',
       yaxis_title = 'number of prizes',
       coloraxis_showscale = False )
fig.show()

Number of Prizes Awarded Over Time¶

Are more prizes awarded recently than when the prize was first created? Show the trend in awards visually.

  • the number of prizes awarded every year.
  • 5 year rolling average of the number of prizes.

  • Looking at the chart, did the first and second world wars have an impact on the number of prizes being given out?

  • What could be the reason for the trend in the chart?
In [31]:
prize_per_year = df_data.groupby(['year'] ,
                                as_index=False).agg({'prize':pd.Series.count})
prize_per_year
Out[31]:
year prize
0 1901 6
1 1902 7
2 1903 7
3 1904 6
4 1905 5
... ... ...
112 2016 11
113 2017 12
114 2018 13
115 2019 14
116 2020 12

117 rows × 2 columns

In [32]:
fig = px.scatter(prize_per_year,y = 'prize' ,
            x = 'year')
fig.update_layout(xaxis_title = 'year',
       yaxis_title = 'number of prizes',
       )
fig.show()
In [33]:
prize_per_year = df_data.groupby(by='year').count().prize
In [34]:
moving_average = prize_per_year.rolling(window=5).mean()
In [35]:
np.arange(1900, 2021, step=5)
plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5), 
           fontsize=14, 
           rotation=45)
 
ax = plt.gca() # get current axis
ax.set_xlim(1900, 2020)
plt.scatter(x=prize_per_year.index, 
            y=prize_per_year.values,
           c='dodgerblue',
           alpha=0.7,
           s=100,)
 
plt.plot(prize_per_year.index, 
        moving_average.values, 
        c='crimson', 
        linewidth=3,)
 
plt.show()

Are More Prizes Shared Than Before?¶

Investigating if more prizes are shared than before.

  • the average prize share of the winners on a year by year basis.
  • the 5 year rolling average of the percentage share.
In [36]:
yearly_avg_share = df_data.groupby(by='year').agg({'share_pct': pd.Series.mean})
share_moving_average = yearly_avg_share.rolling(window=5).mean()
In [37]:
plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5), 
           fontsize=14, 
           rotation=45)
 
ax1 = plt.gca()
ax2 = ax1.twinx() # create second y-axis
ax1.set_xlim(1900, 2020)
 
ax1.scatter(x=prize_per_year.index, 
           y=prize_per_year.values, 
           c='dodgerblue',
           alpha=0.7,
           s=100,)
 
ax1.plot(prize_per_year.index, 
        moving_average.values, 
        c='crimson', 
        linewidth=3,)
 
# Adding prize share plot on second axis
ax2.plot(prize_per_year.index, 
        share_moving_average.values, 
        c='grey', 
        linewidth=3,)
 
plt.show()
In [38]:
plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5), 
           fontsize=14, 
           rotation=45)
 
ax1 = plt.gca()
ax2 = ax1.twinx()
ax1.set_xlim(1900, 2020)
 
# Can invert axis
ax2.invert_yaxis()
 
ax1.scatter(x=prize_per_year.index, 
           y=prize_per_year.values, 
           c='dodgerblue',
           alpha=0.7,
           s=100,)
 
ax1.plot(prize_per_year.index, 
        moving_average.values, 
        c='crimson', 
        linewidth=3,)
 
ax2.plot(prize_per_year.index, 
        share_moving_average.values, 
        c='grey', 
        linewidth=3,)
 
plt.show()

The Countries with the Most Nobel Prizes¶

  • DataFrame called top20_countries that has the two columns. The prize column contain the total number of prizes won.

  • What is the ranking for the top 20 countries in terms of the number of prizes?

In [39]:
top20_countries = df_data.groupby(by = 'birth_country_current',
                                as_index=False).agg({'prize':pd.Series.count})
# top20_countries
top20_countries = top20_countries.sort_values('prize')[-20:]#to plot correctly
In [40]:
plt.figure(figsize=(16,8), dpi=200)
fig = px.bar(top20_countries,y = 'birth_country_current' ,
            x = 'prize',orientation='h',

            color ='prize',
            color_continuous_scale='Aggrnyl',
            title='top 20 countries by number of prizes')

fig.update_layout(xaxis_title = 'number of prizes',
       yaxis_title = 'country',
       coloraxis_showscale = False )

fig.show()
<Figure size 3200x1600 with 0 Axes>

Choropleth Map to Show the Number of Prizes Won by Country¶

In [41]:
df_countries = df_data.groupby(['birth_country_current', 'ISO'], 
                               as_index=False).agg({'prize': pd.Series.count})
df_countries.sort_values('prize', ascending=False)
Out[41]:
birth_country_current ISO prize
74 United States of America USA 281
73 United Kingdom GBR 105
26 Germany DEU 84
25 France FRA 57
67 Sweden SWE 29
... ... ... ...
32 Iceland ISL 1
47 Madagascar MDG 1
34 Indonesia IDN 1
36 Iraq IRQ 1
78 Zimbabwe ZWE 1

79 rows × 3 columns

In [42]:
fig = px.choropleth(df_countries, locations='ISO',  color='prize',
                           color_continuous_scale=px.colors.sequential.matter,
                           range_color=(0, 250),
                    hover_name='birth_country_current',
                        )
# fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In Which Categories are the Different Countries Winning Prizes?¶

dividing up the plotly bar chart you created above to show the which categories made up the total number of prizes. Here's what we're aiming for:

  • In which category are Germany and Japan the weakest compared to the United States?
  • In which category does Germany have more prizes than the UK?
  • In which categories does France have more prizes than Germany?
  • Which category makes up most of Australia's nobel prizes?
  • Which category makes up half of the prizes in the Netherlands?
  • Does the United States have more prizes in Economics than all of France? What about in Physics or Medicine?
In [43]:
df_categories = df_data.groupby(['birth_country_current', 'category'], 
                               as_index=False).agg({'prize': pd.Series.count  })
df_categories.sort_values('prize', ascending=False)
Out[43]:
birth_country_current category prize
204 United States of America Medicine 78
206 United States of America Physics 70
201 United States of America Chemistry 55
202 United States of America Economics 49
198 United Kingdom Medicine 28
... ... ... ...
97 Iraq Peace 1
99 Ireland Medicine 1
100 Ireland Physics 1
102 Israel Economics 1
210 Zimbabwe Peace 1

211 rows × 3 columns

In [44]:
merged_df = pd.merge(df_categories , top20_countries , on='birth_country_current')

merged_df.columns = ['birth_country_current' , 'category','cat_prize','total_prize']
merged_df.sort_values(by='total_prize' , inplace=True)
In [45]:
cat_cntry_bar = px.bar(x=merged_df.cat_prize,
                       y=merged_df.birth_country_current,
                       color=merged_df.category,
                       orientation='h',
                       title='Top 20 Countries by Number of Prizes and Category')
 
cat_cntry_bar.update_layout(xaxis_title='Number of Prizes', 
                            yaxis_title='Country')
cat_cntry_bar.show()

Number of Prizes Won by Each Country Over Time¶

  • When did the United States eclipse every other country in terms of the number of prizes won?
  • Which country or countries were leading previously?
  • the cumulative number of prizes won by each country in every year.
In [46]:
prize_by_year = df_data.groupby(by=['birth_country_current', 'year'], as_index=False).count()
prize_by_year = prize_by_year.sort_values('year')[['year', 'birth_country_current', 'prize']]
In [47]:
cumulative_prizes = prize_by_year.groupby(by=['birth_country_current',
                                              'year']).sum().groupby(level=[0]).cumsum()
cumulative_prizes.reset_index(inplace=True)
cumulative_prizes
Out[47]:
birth_country_current year prize
0 Algeria 1957 1
1 Algeria 1997 2
2 Argentina 1936 1
3 Argentina 1947 2
4 Argentina 1980 3
... ... ... ...
622 United States of America 2020 281
623 Venezuela 1980 1
624 Vietnam 1973 1
625 Yemen 2011 1
626 Zimbabwe 1960 1

627 rows × 3 columns

In [48]:
l_chart = px.line(cumulative_prizes,
                  x='year', 
                  y='prize',
                  color='birth_country_current',
                  hover_name='birth_country_current')
 
l_chart.update_layout(xaxis_title='Year',
                      yaxis_title='Number of Prizes')
 
l_chart.show()

What are the Top Research Organisations?¶

Creating a bar chart showing the organisations affiliated with the Nobel laureates. It should looks something like this:

  • Which organisations make up the top 20?
  • How many Nobel prize winners are affiliated with the University of Chicago and Harvard University?
In [49]:
top20_org = df_data.groupby(by = 'organization_name',
                                as_index=False).agg({'prize':pd.Series.count})
# # top20_countries
top20_org = top20_org.sort_values('prize')[-20:]#to plot correctly
In [50]:
plt.figure(figsize=(16,8), dpi=200)
fig = px.bar(top20_org,y = 'organization_name' ,
            x = 'prize',orientation='h',

            color ='prize',
            color_continuous_scale='Aggrnyl',
            title='top 20 orgs by number of prizes')

fig.update_layout(xaxis_title = 'number of prizes',
       yaxis_title = 'Org',
       coloraxis_showscale = False )

fig.show()
<Figure size 3200x1600 with 0 Axes>

Which Cities Make the Most Discoveries?¶

Where do major discoveries take place?

  • Creating another plotly bar chart graphing the top 20 organisation cities of the research institutions associated with a Nobel laureate.
  • Where is the number one hotspot for discoveries in the world?
  • Which city in Europe has had the most discoveries?
In [51]:
top20_org_city = df_data.groupby(by = 'organization_city',
                                as_index=False).agg({'prize':pd.Series.count})
# # top20_countries
top20_org_city = top20_org_city.sort_values('prize')[-20:]#to plot correctly
plt.figure(figsize=(16,8), dpi=200)
fig = px.bar(top20_org_city,y = 'organization_city' ,
            x = 'prize',orientation='h',

            color ='prize',
            color_continuous_scale='Aggrnyl',
            title='top 20 org_city by number of prizes')

fig.update_layout(xaxis_title = 'number of prizes',
       yaxis_title = 'organization_city',
       coloraxis_showscale = False )

fig.show()
<Figure size 3200x1600 with 0 Axes>

Where are Nobel Laureates Born? Chart the Laureate Birth Cities¶

  • Creating a plotly bar chart graphing the top 20 birth cities of Nobel laureates.
  • What percentage of the United States prizes came from Nobel laureates born in New York?
  • How many Nobel laureates were born in London, Paris and Vienna?
  • Out of the top 5 cities, how many are in the United States?
In [52]:
df_data.head()
Out[52]:
year category prize motivation prize_share laureate_type full_name birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO share_pct
0 1901 Chemistry The Nobel Prize in Chemistry 1901 "in recognition of the extraordinary services ... 1/1 Individual Jacobus Henricus van 't Hoff 1852-08-30 Rotterdam Netherlands Netherlands Male Berlin University Berlin Germany NLD 1.00
1 1901 Literature The Nobel Prize in Literature 1901 "in special recognition of his poetic composit... 1/1 Individual Sully Prudhomme 1839-03-16 Paris France France Male NaN NaN NaN FRA 1.00
2 1901 Medicine The Nobel Prize in Physiology or Medicine 1901 "for his work on serum therapy, especially its... 1/1 Individual Emil Adolf von Behring 1854-03-15 Hansdorf (Lawice) Prussia (Poland) Poland Male Marburg University Marburg Germany POL 1.00
3 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 Individual Frédéric Passy 1822-05-20 Paris France France Male NaN NaN NaN FRA 0.50
4 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 Individual Jean Henry Dunant 1828-05-08 Geneva Switzerland Switzerland Male NaN NaN NaN CHE 0.50
In [53]:
top20_birth_city = df_data.groupby(by = 'birth_city',
                                as_index=False).agg({'prize':pd.Series.count})
# # top20_countries
top20_birth_city = top20_birth_city.sort_values('prize')[-20:]#to plot correctly
plt.figure(figsize=(16,8), dpi=200)
fig = px.bar(top20_birth_city,y = 'birth_city' ,
            x = 'prize',orientation='h',

            color ='prize',
            color_continuous_scale='Plasma',
            title='top 20 birth_city by number of prizes')

fig.update_layout(xaxis_title = 'number of prizes',
       yaxis_title = 'birth_city',
       coloraxis_showscale = False )

fig.show()
<Figure size 3200x1600 with 0 Axes>

Plotly Sunburst Chart: Combine Country, City, and Organisation¶

  • Creating a DataFrame that groups the number of prizes by organisation.
In [54]:
prizes_Per_org_cntry = df_data.groupby(by=['organization_country', 
                                       'organization_city', 
                                       'organization_name'],
                                as_index=False).agg({'prize':pd.Series.count})

prizes_Per_org_cntry = prizes_Per_org_cntry.sort_values('prize',ascending=False)#to plot correctly
prizes_Per_org_cntry
Out[54]:
organization_country organization_city organization_name prize
205 United States of America Cambridge, MA Harvard University 29
280 United States of America Stanford, CA Stanford University 23
206 United States of America Cambridge, MA Massachusetts Institute of Technology (MIT) 21
209 United States of America Chicago, IL University of Chicago 20
195 United States of America Berkeley, CA University of California 19
... ... ... ... ...
110 Japan Sapporo Hokkaido University 1
111 Japan Tokyo Asahi Kasei Corporation 1
112 Japan Tokyo Kitasato University 1
113 Japan Tokyo Tokyo Institute of Technology 1
290 United States of America Yorktown Heights, NY IBM Thomas J. Watson Research Center 1

291 rows × 4 columns

In [55]:
fig = px.sunburst(prizes_Per_org_cntry, path=['organization_country', 
                                       'organization_city', 
                                       'organization_name'],
                  values='prize',
                  color='prize', hover_data=['prize'],
                  color_continuous_scale='RdBu',
                  color_continuous_midpoint=np.average(prizes_Per_org_cntry['prize'],
                                                       weights=prizes_Per_org_cntry['prize']))
fig.show()

Patterns in the Laureate Age at the Time of the Award¶

How Old Are the Laureates When the Win the Prize?

In [56]:
df_data.head()
Out[56]:
year category prize motivation prize_share laureate_type full_name birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO share_pct
0 1901 Chemistry The Nobel Prize in Chemistry 1901 "in recognition of the extraordinary services ... 1/1 Individual Jacobus Henricus van 't Hoff 1852-08-30 Rotterdam Netherlands Netherlands Male Berlin University Berlin Germany NLD 1.00
1 1901 Literature The Nobel Prize in Literature 1901 "in special recognition of his poetic composit... 1/1 Individual Sully Prudhomme 1839-03-16 Paris France France Male NaN NaN NaN FRA 1.00
2 1901 Medicine The Nobel Prize in Physiology or Medicine 1901 "for his work on serum therapy, especially its... 1/1 Individual Emil Adolf von Behring 1854-03-15 Hansdorf (Lawice) Prussia (Poland) Poland Male Marburg University Marburg Germany POL 1.00
3 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 Individual Frédéric Passy 1822-05-20 Paris France France Male NaN NaN NaN FRA 0.50
4 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 Individual Jean Henry Dunant 1828-05-08 Geneva Switzerland Switzerland Male NaN NaN NaN CHE 0.50
In [57]:
birth_age = pd.to_numeric(df_data.birth_date.dt.strftime('%Y'))
df_data['winning_age'] = df_data.year - birth_age
df_data
Out[57]:
year category prize motivation prize_share laureate_type full_name birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO share_pct winning_age
0 1901 Chemistry The Nobel Prize in Chemistry 1901 "in recognition of the extraordinary services ... 1/1 Individual Jacobus Henricus van 't Hoff 1852-08-30 Rotterdam Netherlands Netherlands Male Berlin University Berlin Germany NLD 1.00 49.00
1 1901 Literature The Nobel Prize in Literature 1901 "in special recognition of his poetic composit... 1/1 Individual Sully Prudhomme 1839-03-16 Paris France France Male NaN NaN NaN FRA 1.00 62.00
2 1901 Medicine The Nobel Prize in Physiology or Medicine 1901 "for his work on serum therapy, especially its... 1/1 Individual Emil Adolf von Behring 1854-03-15 Hansdorf (Lawice) Prussia (Poland) Poland Male Marburg University Marburg Germany POL 1.00 47.00
3 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 Individual Frédéric Passy 1822-05-20 Paris France France Male NaN NaN NaN FRA 0.50 79.00
4 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 Individual Jean Henry Dunant 1828-05-08 Geneva Switzerland Switzerland Male NaN NaN NaN CHE 0.50 73.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
957 2020 Medicine The Nobel Prize in Physiology or Medicine 2020 “for the discovery of Hepatitis C virus” 1/3 Individual Michael Houghton 1949-07-02 NaN United Kingdom United Kingdom Male University of Alberta Edmonton Canada GBR 0.33 71.00
958 2020 Peace The Nobel Peace Prize 2020 “for its efforts to combat hunger, for its con... 1/1 Organization World Food Programme (WFP) NaT NaN NaN NaN NaN NaN NaN NaN NaN 1.00 NaN
959 2020 Physics The Nobel Prize in Physics 2020 “for the discovery of a supermassive compact o... 1/4 Individual Andrea Ghez 1965-06-16 New York, NY United States of America United States of America Female University of California Berkeley, CA United States of America USA 0.25 55.00
960 2020 Physics The Nobel Prize in Physics 2020 “for the discovery of a supermassive compact o... 1/4 Individual Reinhard Genzel 1952-03-24 Bad Homburg vor der Höhe Germany Germany Male University of California Los Angeles, CA United States of America DEU 0.25 68.00
961 2020 Physics The Nobel Prize in Physics 2020 “for the discovery that black hole formation i... 1/2 Individual Roger Penrose 1931-08-08 Colchester United Kingdom United Kingdom Male University of Oxford Oxford United Kingdom GBR 0.50 89.00

962 rows × 18 columns

In [58]:
df_data.describe()
Out[58]:
year share_pct winning_age
count 962.00 962.00 934.00
mean 1,971.82 0.63 59.95
std 33.81 0.29 12.62
min 1,901.00 0.25 17.00
25% 1,948.00 0.33 51.00
50% 1,977.00 0.50 60.00
75% 2,001.00 1.00 69.00
max 2,020.00 1.00 97.00
In [ ]:
 
In [ ]:
 

Who were the oldest and youngest winners?¶

  • What are the names of the youngest and oldest Nobel laureate?
  • What did they win the prize for?
  • What is the average age of a winner?
In [59]:
print('oldest winner')
display(df_data.nlargest(n=1, columns='winning_age'))
print('youngest winner')
display(df_data.nsmallest(n=1, columns='winning_age'))
oldest winner
year category prize motivation prize_share laureate_type full_name birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO share_pct winning_age
937 2019 Chemistry The Nobel Prize in Chemistry 2019 “for the development of lithium-ion batteries” 1/3 Individual John Goodenough 1922-07-25 Jena Germany Germany Male University of Texas Austin TX United States of America DEU 0.33 97.00
youngest winner
year category prize motivation prize_share laureate_type full_name birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO share_pct winning_age
885 2014 Peace The Nobel Peace Prize 2014 "for their struggle against the suppression of... 1/2 Individual Malala Yousafzai 1997-07-12 Mingora Pakistan Pakistan Female NaN NaN NaN PAK 0.50 17.00

Descriptive Statistics for the Laureate Age at Time of Award¶

  • Calculating the descriptive statistics for the age at the time of the award.
In [60]:
df_data.describe()
Out[60]:
year share_pct winning_age
count 962.00 962.00 934.00
mean 1,971.82 0.63 59.95
std 33.81 0.29 12.62
min 1,901.00 0.25 17.00
25% 1,948.00 0.33 51.00
50% 1,977.00 0.50 60.00
75% 2,001.00 1.00 69.00
max 2,020.00 1.00 97.00
In [61]:
plt.figure(figsize=(8, 4), dpi=200)
sns.histplot(data=df_data,
             x=df_data.winning_age,
             bins=30)
plt.xlabel('Age')
plt.title('Distribution of Age on Receipt of Prize')
plt.show()

Age at Time of Award throughout History¶

Are Nobel laureates being nominated later in life than before? Have the ages of laureates at the time of the award increased or decreased over time?

  • According to the best fit line, how old were Nobel laureates in the years 1900-1940 when they were awarded the prize?
  • According to the best fit line, what age would it predict for a Nobel laureate in 2020?

The histogram above shows us the distribution across the entire dataset, over the entire time period. But perhaps the age has changed over time.

In [62]:
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"):
    sns.regplot(data=df_data,
                x='year',
                y='winning_age',
                lowess=True, 
                scatter_kws = {'alpha': 0.4},
                line_kws={'color': 'black'})
 
plt.show()

Winning Age Across the Nobel Prize Categories¶

How does the age of laureates vary by category?

  • Which category has the longest "whiskers"?
  • In which prize category are the average winners the oldest?
  • In which prize category are the average winners the youngest?
In [63]:
df_data.head()
Out[63]:
year category prize motivation prize_share laureate_type full_name birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO share_pct winning_age
0 1901 Chemistry The Nobel Prize in Chemistry 1901 "in recognition of the extraordinary services ... 1/1 Individual Jacobus Henricus van 't Hoff 1852-08-30 Rotterdam Netherlands Netherlands Male Berlin University Berlin Germany NLD 1.00 49.00
1 1901 Literature The Nobel Prize in Literature 1901 "in special recognition of his poetic composit... 1/1 Individual Sully Prudhomme 1839-03-16 Paris France France Male NaN NaN NaN FRA 1.00 62.00
2 1901 Medicine The Nobel Prize in Physiology or Medicine 1901 "for his work on serum therapy, especially its... 1/1 Individual Emil Adolf von Behring 1854-03-15 Hansdorf (Lawice) Prussia (Poland) Poland Male Marburg University Marburg Germany POL 1.00 47.00
3 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 Individual Frédéric Passy 1822-05-20 Paris France France Male NaN NaN NaN FRA 0.50 79.00
4 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 Individual Jean Henry Dunant 1828-05-08 Geneva Switzerland Switzerland Male NaN NaN NaN CHE 0.50 73.00
In [64]:
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"):
    sns.boxplot(data=df_data,
                x='category',
                y='winning_age')
 
plt.show()
  • What are the winning age trends in each category?
  • Which category has the age trending up and which category has the age trending down?
  • Is this .lmplot() telling a different story from the .boxplot()?
  • Creating another chart with Seaborn. This time use .lmplot() to put all 6 categories on the same chart using the hue parameter.
In [65]:
with sns.axes_style('whitegrid'):
    sns.lmplot(data=df_data,
               x='year', 
               y='winning_age',
               row = 'category',
               lowess=True, 
               aspect=2,
               scatter_kws = {'alpha': 0.6},
               line_kws = {'color': 'black'},)
 
plt.show()

combining all these charts into the same chart

In [66]:
with sns.axes_style("whitegrid"):
    sns.lmplot(data=df_data,
               x='year',
               y='winning_age',
               hue='category',
               lowess=True, 
               aspect=2,
               scatter_kws={'alpha': 0.5},
               line_kws={'linewidth': 5})
 
plt.show()